Back

Bioinformatics Advances

Oxford University Press (OUP)

All preprints, ranked by how well they match Bioinformatics Advances's content profile, based on 184 papers previously published here. The average preprint has a 0.16% match score for this journal, so anything above that is already an above-average fit. Older preprints may already have been published elsewhere.

1
CilioGenics: an integrated method and database for predicting novel ciliary genes

Pir, M. S.; Yenisert, F.; Karaman, A.; Begar, E.; Tsiropoulou, S.; Firat-Karalar, E. N.; Blacque, O. E.; Oner, S. S.; Doluca, O.; Cevik, S.; Kaplan, O. I.

2023-04-02 bioinformatics 10.1101/2023.03.31.535034 medRxiv
Top 0.1%
44.8%
Show abstract

Discovering the entire list of human ciliary genes would help in the diagnosis of cilia-related human disorders known as ciliopathy, but at present the genetic diagnosis of many ciliopathies (over 30%) is far from complete (Bachmann-Gagescu et al., 2015; Knopp et al., 2015; Paff et al., 2018). In a theory, many independent approaches may uncover the whole list of ciliary genes, but 30% of the genes on the ciliary gene list are still ciliary candidate genes (van Dam et al., 2019; Vasquez et al., 2021). All of these cutting-edge techniques, however, have relied on a different single strategy to discover ciliary candidate genes. Because different methodologies demonstrated distinct capabilities with varying quality, categorizing the ciliary candidate genes in the ciliary gene list without further evidence has been difficult. Here, we present a method for predicting ciliary capacity of each human gene that incorporates diverse methodologies (single-cell RNA sequencing, protein-protein interactions (PPIs), comparative genomics, transcription factor (TF)-network analysis, and text mining). By integrating multiple approaches, we reveal previously undiscovered ciliary genes. Our method, CilioGenics, outperforms other approaches that are dependent on a single method. Our top 500 gene list contains 256 new candidate ciliary genes, with 31 experimentally validated. Our work suggests that combining several techniques can give useful evidence for predicting the ciliary capability of all human genes.

2
Can large language models reliably extract human disease genes from full-text scientific literature?

Yin, D.; Leung, M. K. S.; Pun, D. W. H.; Chen, F. H.; Kwon, J. Y.; Lin, X.; Ho, J. W. K.

2025-07-31 bioinformatics 10.1101/2025.07.27.667022 medRxiv
Top 0.1%
36.1%
Show abstract

Manual extraction of high-fidelity gene-disease-phenotype information from human genetics literature is a labor-intensive task that requires trained human genetics researchers to read through many primary research papers. This presents a major challenge for maintaining up-to-date human disease genetic databases. Recent exploration into large language models (LLMs) opens new directions in automating this manual process. However, most approaches depend on pre-training, finetuning, or specialized generative artificial intelligence (GenAI) tools, but there is a lack of empirical evidence to show whether commercially-available LLMs can be directly used to reliably extract gene-disease-phenotype for human genetic diseases. Herein, we perform a benchmark of the use of three zero-shot prompted LLMs, namely GPT-4, DeepSeek and Claude, without task-specific fine-tuning, to extract human genetic information directly from full text of scientific papers. Using known congenital heart diseases (CHD) genes found in the open access CHDgene database (https://chdgene.victorchang.edu.au/) as the benchmark data set, GPT-4o achieved overall 88.8% extraction accuracy across 23 gene entries containing over 57 references, with 100% accuracy in gene name, 78.3% and 76.7% in disease and phenotype fields respectively. This work introduces a lightweight, easy-to-deploy, and yet robust LLM-based agent named GeneAgent, analyze sources of disagreement, and highlight the feasibility of integrating powerful LLM into genetic evidence synthesis workflows. Highlight- First systematic benchmark of LLMs for extracting human gene-disease-phenotype relationships from full-text biomedical articles - GeneAgent: a lightweight, highly accurate prompt-only LLM agent - New domain task-specific evaluation framework

3
Automatic variant prioritization in suspected genetic kidney disease using the Nephro Candidate Score (N-CS)

Rank, N.; Lukassen, S.; Anderegg, M.; Eckardt, K.-U.; Halbritter, J. P.; Popp, B.

2025-09-30 nephrology 10.1101/2025.09.29.25336840 medRxiv
Top 0.1%
34.5%
Show abstract

Research QuestionDespite the identification of >700 genes linked to rare and inherited kidney diseases (IKD), many individuals with presumed IKD do not receive a diagnosis through genetic testing of known disease genes. Therefore, the identification of new disease genes is crucial to ending diagnostic odysseys, improving genetic counseling, and expanding treatment options. While the generation of large-scale sequencing data is no longer a substantial bottleneck, its interpretation remains challenging and offers room for improvement, notably in the discovery of novel disease genes. MethodsWe developed the Nephro Candidate Score (N-CS), a machine learning (ML) tool that prioritizes variants by combining a Nephro Gene Score (N-GS), a Nephro Variant Score (N-VS), and an Inheritance Score (IS). The ML-based N-GS and N-VS were trained on a wide range of genomic features to predict gene-disease relevance and variant pathogenicity, while the IS incorporates the mode of inheritance via a scoring heuristic. A Gene Set Enrichment Analysis (GSEA) was used to test whether genes top-ranked by the N-GS were enriched for kidney-related biological processes. Additionally, we tested the N-CS on an independent set of novel IKD candidate genes identified through a systematic literature search to validate its real-world performance. ResultsThe machine learning models for the N-CS subscores demonstrated high predictive accuracy, with an XGBoost algorithm for the N-GS achieving an AUC of 0.94 and a Logistic Regression model for the N-VS reaching an AUC of 0.99 in independent test sets. The biological relevance of the N-GS ranking was confirmed by the GSEA showing a significant enrichment of kidney-associated biological processes among top-scoring genes (p < 0.001). In the independent validation using recently published literature, the N-CS assigned compellingly high scores to the majority (10 of 11) of novel candidate genes for kidney disease, demonstrating its ability to generalize to new discoveries. ConclusionThe N-CS is a robust digital solution that can accelerate disease gene discovery and comes with the potential to reduce time to diagnosis. To support standardization and collaboration, the full N-CS framework is freely available, including a user-friendly web tool (NC-Scorer: https://nc-scorer.kidney-genetics.org/) and a command-line interface for high-throughput analysis, enabling standardized, sharable evaluation of candidate variants.

4
CoEVFold suite: user friendly pipelines to visually represent protein coevolution

Graham, C. L.; Cremona, L.; Little, R.; Rodrigues, C. D.

2026-01-26 microbiology 10.64898/2026.01.26.701017 medRxiv
Top 0.1%
34.2%
Show abstract

Multiple sequence alignment (MSA) data underlies current principles in protein folding and protein-protein interaction prediction, from which large language models (LLMs) in tandem with protein datasets, can predict protein structure. However, what is missing are user-friendly tools that enable researchers to predict and demonstrate coevolution - the principal input which these MSAs infer. Here we present tools to identify and visualize coevolution, through a pipeline (CoEVFold) that uses basic direct coupling algorithms derived from GREMLIN and alignment of sequences from MMSEQs2. The pipeline generates a visual representation of coevolution for a single protein but can also represent coevolution of homomeric or heteromeric protein complexes, as well as coevolution within protein networks. The input for this pipeline can be an amino acid sequence, or user input protein structures from Alphafold their own files or the PDB database. In validation of CoEVFolds capabilities, and utilising proteins from known prokaryotic and eukaryotic model systems (Bacillus subtilis, Escherichia coli and Saccharomyces cerevisiae), as well as phage proteins, CoEVFold predicts coevolution between proteins known to interact, proteins known to oligomerise, and coevolution in proteins known to be part of a protein complex. Collectively, these suite of tools, named CoEVFold suite, have broad applicability and provide a useful toolkit to those interested in dissecting protein-protein interactions and networks. AvailabilityThe code is available online at https://colab.research.google.com/drive/1MSSvNTq7KZ4Lr0XTz89vUuK-J3xOTzwS?usp=sharing and Github. https://github.com/MishterBluesky/CoEVFold Supplementary informationSupplementary data is available via Figshare and supplementary materials.

5
A RAG Chatbot for Precision Medicine of Multiple Myeloma

Quidwai, M. A.; Lagana, A.

2024-03-18 genetic and genomic medicine 10.1101/2024.03.14.24304293 medRxiv
Top 0.1%
34.2%
Show abstract

The advent of precision medicine has revolutionized cancer treatment by integrating individual genetic, lifestyle, and environmental factors to tailor patient care (Huang et al., 2020; Ginsburg and Phillips, 2018). However, the complexity and heterogeneity of diseases like Multiple Myeloma (MM) pose significant challenges in leveraging the vast amounts of genomic data and biomedical literature available for personalized treatment planning (Rajkumar, 2014; Rollig et al., 2015). To address this, we present an innovative Retrieval-Augmented Generation (RAG) based chatbot framework that harnesses the power of Natural Language Processing (NLP) and state-of-the-art language models to curate and analyze MM-specific literature and provide personalized treatment recommendations based on patient-specific genomic data (Lewis et al., 2020). Our framework integrates the BioMed-RoBERTa-base model for embedding generation (Gururangan et al., 2020) and the Mistral-7B language model for question answering (Anthropic, 2023), enabling effective understanding and response to complex clinical queries. The retrieval component is enhanced by Amazon OpenSearch Service, ensuring fast and accurate access to relevant information. A comprehensive data analysis pipeline, including exploratory data analysis, semantic search, clustering, and topic modeling, provides valuable insights into the MM research landscape, informing the chatbots knowledge base and uncovering potential research directions (Blei et al., 2003; Mikolov et al., 2013). Deployed using Amazon Kendra, our RAG chatbot offers a user-friendly and scalable platform for accessing MM information, incorporating features such as user authentication, customizable web interface, and continuous improvement based on user feedback. The framework aims to democratize access to precision medicine by providing clinicians with a sophisticated tool for interpreting complex genomic data in the context of MM, streamlining clinical workflows, and facilitating the development of personalized treatment plans (Patel et al., 2015). This paper presents the conceptualization, development, and potential impact of our RAG-based chatbot framework on the landscape of MM treatment and precision medicine. We argue that the synergistic integration of AI, NLP, and domain-specific knowledge marks a new era of healthcare, characterized by highly personalized, data-driven, and effective treatment modalities (Thong et al., 2021). Our framework not only advances the field of precision medicine in MM but also serves as a blueprint for the development of similar systems in other complex diseases, ultimately improving patient outcomes and quality of life.

6
Deep Integrated Network Analysis: a data-driven tool to discover and characterize disease pathways in the liver

Quin, J. E.; Urrutia Iturritza, M.; Mosquera, K. D.; Hildebrandt, F. F. A.; Barrenas, F.; Ankarklev, J.

2025-03-18 systems biology 10.1101/2025.03.17.643687 medRxiv
Top 0.1%
28.6%
Show abstract

Background & AimsAn extensive number of studies have utilized transcriptomic profiling as a valuable tool for uncovering genes related to diseases and physiological processes of the liver. Here we combine this wealth of information to provide a powerful resource, by computationally constructing a comprehensive and unbiased network of gene interactions specific to the liver. MethodsWe have performed a computational approach termed Deep Integrated Network Analysis (DINA) on a curated catalog of 655 liver transcriptomic datasets (including a total of 48,311 transcriptomes). These datasets include human, monkey, mouse, rat and other mammalian species, and studies linked to a broad range of conditions. Together this facilitated construction of a network of strongly conserved gene-gene interactions relevant across the spectrum of liver diseases. ResultsThe Liver DINA Resource described herein contains 89,683 statistically conserved interactions among 19,317 genes in a unified network unique to the mammalian liver. The network unveils a hierarchical structure of strongly co-regulated modules, which are organized into a Tree-and-Leaf Network to provide a comprehensive overview of the resource. ConclusionsThis data-driven resource provides an interactive, publicly available tool for the examination of previously undescribed gene networks, and enables unbiased analysis of transcriptomic datasets of the liver, thus preventing bias in favor of well-studied genes and pathways and providing a complementary approach towards novel discoveries.

7
Retrieval Augmented Protein Language Models for Protein Structure Prediction

Li, P.; Cheng, X.; Song, L.; Xing, E. P.

2024-12-05 bioinformatics 10.1101/2024.12.02.626519 medRxiv
Top 0.1%
28.4%
Show abstract

The advent of advanced artificial intelligence technology has significantly accelerated progress in protein structure prediction, with AlphaFold2 setting a new benchmark for prediction accuracy by leveraging the Evoformer module to automatically extract co-evolutionary information from multiple sequence alignments (MSA). To address AlphaFold2s dependence on MSA depth and quality, we propose two novel models: AIDO.RAGPLM and AIDO.RAGFold, pretrained modules for Retrieval-AuGmented protein language model and structure prediction in an AI-driven Digital Organism (Song et al., 2024). AIDO.RAGPLM integrates pre-trained protein language models with retrieved MSA, surpassing single-sequence protein language models in perplexity, contact prediction, and fitness prediction. When sufficient MSA is available, AIDO.RAGFold achieves TM-scores comparable to AlphaFold2 while operating up to eight times faster, and significantly outperforms AlphaFold2 when MSA is insufficient ({Delta}TM-score=0.379, 0.116 and 0.059 for 0, 5 and 10 MSA sequences as input). Additionally, we developed an MSA retriever using hierarchical ID generation that is 45 to 90 times faster than traditional methods, expanding the MSA training set for AIDO.RAGPLM by 32%. Our findings suggest that AIDO.RAGPLM provides an efficient and accurate solution for protein structure prediction, particularly in scenarios with limited MSA data. The AIDO.RAGPLM model has been open-sourced and is available on https://huggingface.co/genbio-ai/AIDO.Protein-RAG-3B.

8
Y2H-SCORES: A statistical framework to infer protein-protein interactions from next-generation yeast-two-hybrid sequence data

Velasquez-Zapata, V.; Elmore, J. M.; Banerjee, S.; Dorman, K. S.; Wise, R. P.

2020-09-09 systems biology 10.1101/2020.09.08.288365 medRxiv
Top 0.1%
28.3%
Show abstract

Interactomes embody one of the most effective representations of cellular behavior by revealing function through protein associations. In order to build these models at the organism scale, high-throughput techniques are required to identify interacting pairs of proteins. Next-generation interaction screening (NGIS) protocols that combine yeast two-hybrid (Y2H) with deep sequencing are promising approaches to generate protein-protein interaction networks in any organism. However, challenges remain to mining reliable information from these screens and thus, limit its broader implementation. Here, we describe a statistical framework, designated Y2H-SCORES, for analyzing high-throughput Y2H screens that considers key aspects of experimental design, normalization, and controls. Three quantitative ranking scores were implemented to identify interacting partners, comprising: 1) significant enrichment under selection for positive interactions, 2) degree of interaction specificity among multi-bait comparisons, and 3) selection of in-frame interactors. Using simulation and an empirical dataset, we provide a quantitative assessment to predict interacting partners under a wide range of experimental scenarios, facilitating independent confirmation by one-to-one bait-prey tests. Simulation of Y2H-NGIS identified conditions that maximize detection of true interactors, which can be achieved with protocols such as prey library normalization, maintenance of larger culture volumes and replication of experimental treatments. Y2H-SCORES can be implemented in different yeast-based interaction screenings, accelerating the biological interpretation of experimental results. Proof-of-concept was demonstrated by discovery and validation of a novel interaction between the barley powdery mildew effector, AVRA13, with the vesicle-mediated thylakoid membrane biogenesis protein, HvTHF1. Author SummaryOrganisms respond to their environment through networks of interacting proteins and other biomolecules. In order to investigate these interacting proteins, many in vitro and in vivo techniques have been used. Among these, yeast two-hybrid (Y2H) has been integrated with next generation sequencing (NGS) to approach protein-protein interactions on a genome-wide scale. The fusion of these two methods has been termed next-generation-interaction screening, abbreviated as Y2H-NGIS. However, the massive and diverse data sets resulting from this technology have presented unique challenges to analysis. To address these challenges, we optimized the computational and statistical evaluation of Y2H-NGIS to provide metrics to identify high-confidence interacting proteins under a variety of dataset scenarios. Our proposed framework can be extended to different yeast-based interaction settings, utilizing the general principles of enrichment, specificity, and in-frame prey selection to accurately assemble protein-protein interaction networks. Lastly, we showed how the pipeline works experimentally, by identifying and validating a novel interaction between the barley powdery mildew effector AVRA13 and the barley vesicle-mediated thylakoid membrane biogenesis protein, HvTHF1. Y2H-SCORES software is available at GitHub repository https://github.com/Wiselab2/Y2H-SCORES.

9
Uncovering biological patterns across studies through automated large-scale reanalyses of public transcriptomic data

Chen, K. G.; Lassmann, T.

2025-11-05 bioinformatics 10.1101/2025.11.04.686647 medRxiv
Top 0.1%
27.5%
Show abstract

Large amounts of transcriptomic data have been made available in public repositories. Systematic reanalyses of these data offer the potential to identifying conserved biological patterns or context-specific signatures. However, this is a labour intensive process requiring bioinformatic expertise and a long chain of manual decision making. Use of LLMs and agentic systems holds promise for automating these otherwise time-consuming tasks. Here, we present UORCA (Unified -Omics Reference Corpus of Analyses), a tool to systematically identify and analyse public transcriptomic datasets. UORCA uses an LLM-assisted framework to search for datasets relevant to a research question. These datasets are analysed through a multi-agent system that performs a standardised bioinformatic analyses to identify differentially expressed genes. Results of each analysis are then displayed in an interactive visual interface. We found that UORCA recapitulated findings reported from a manual comparison of datasets, but also found biological signatures that were not initially described. We find that UORCA generates targeted hypotheses relevant for drug design, and facilitates evaluation of experimental results where they differ from past literature. Together, these findings demonstrate how UORCA accelerates biomedical discovery by enabling scientists to extract actionable findings from diverse public datasets.

10
Finding human gene-disease associations using a Network Enhanced Similarity Search (NESS) of multi-species heterogeneous functional genomics data

Reynolds, T.; Bubier, J. A.; Langston, M. A.; Chesler, E. J.; Baker, E. J.

2020-03-13 bioinformatics 10.1101/2020.03.11.987552 medRxiv
Top 0.1%
26.1%
Show abstract

Disease diagnosis and treatment is challenging in part due to the misalignment of diagnostic categories with the underlying biology of disease. The evaluation of large-scale genomic experimental datasets is a compelling approach to refining the classification of biological concepts, such as disease. Well-established approaches, some of which rely on information theory or network analysis, quantitatively assess relationships among biological entities using gene annotations, structured vocabularies, and curated data sources. However, the gene annotations used in these evaluations are often sparse, potentially biased due to uneven study and representation in the literature, and constrained to the single species from which they were derived. In order to overcome these deficiencies inherent in the structure and sparsity of these annotated datasets, we developed a novel Network Enhanced Similarity Search (NESS) tool which takes advantage of multi-species networks of heterogeneous data to bridge sparsely populated datasets. NESS employs a random walk with restart algorithm across harmonized multi-species data, effectively compensating for sparsely populated and noisy genomic studies. We further demonstrate that it is highly resistant to spurious or sparse datasets and generates significantly better recapitulation of ground truth biological pathways than other similarity metrics alone. Furthermore, since NESS has been deployed as an embedded tool in the GeneWeaver environment, it can rapidly take advantage of curated multi-species networks to provide informative assertions of relatedness of any pair of biological entities or concepts, e.g., gene-gene, gene-disease, or phenotype-disease associations. NESS ultimately enables multi-species analysis applications to leverage model organism data to overcome the challenge of data sparsity in the study of human disease. Availability and ImplementationImplementation available at https://geneweaver.org/ness. Source code freely available at https://github.com/treynr/ness. Author summaryFinding consensus among large-scale genomic datasets is an ongoing challenge in the biomedical sciences. Harmonizing and analyzing such data is important because it allows researchers to mitigate the idiosyncrasies of experimental systems, alleviate study biases, and augment sparse datasets. Additionally, it allows researchers to utilize animal model studies and cross-species experiments to better understand biological function in health and disease. Here we provide a tool for integrating and analyzing heterogeneous functional genomics data using a graph-based model. We show how this type of analysis can be used to identify similar relationships among biological entities such as genes, processes, and disease through shared genomic associations. Our results indicate this approach is effective at reducing biases caused by sparse and noisy datasets. We show how this type of analysis can be used to aid the classification gene function and prioritization of genes involved in substance use disorders. In addition, our analysis reveals genes and biological pathways with shared association to multiple, co-occurring substance use disorders.

11
Biologically-informed Interpretable Deep Learning Framework for Phenotype Prediction and Gene Interaction Detection

Hequet, C. C.; Gaggiotti, O.; Parachini, S.; Bochukova, E.; Ye, J.

2025-04-17 genomics 10.1101/2025.04.11.648325 medRxiv
Top 0.1%
25.9%
Show abstract

The detection of epistatic effects has significant potential to enhance understanding of the genetic basis of complex traits, but statistical epistatic analysis methods are complex and labour intensive. In recent years, Deep Neural Networks (DNNs) have emerged as a powerful tool for modelling arbitrarily complex genetic interactions in relation to a phenotype; however, their utility is often limited by the challenge of interpreting their predictive reasoning. Although DNN interpretation methods exist, they are typically not designed for genomic applications, leading to hard-to-understand outputs with limited relevance to the field. To address this gap, we introduce GENEPAIR - a novel DNN interpretation framework designed specifically for genomic data, aimed at detecting putative associated gene-gene interactions for a phenotype of interest. Our approach offers several key advantages including model agnositicity, robustness to sample- and variant-level data variance, and flexibility to integrate varied domain knowledge into interpretable features. We demonstrate the efficacy of our method by applying it to a DNN trained on genetic variant data to predict Body Mass Index (BMI). The results of the analysis not only reveal single gene influences in close alignment with literature but also uncover previously unreported gene-gene interactions, demonstrating its significant potential for genomic discovery. Author summaryUnderstanding how genes interact improves our understanding of how genetic pathways influence common diseases, potentially leading to new treatments. However, identifying these interactions is particularly challenging when a large number of genes are involved. Machine learning models, such as Deep Neural Networks (DNNs), excel at detecting complex patterns in data, but interpreting these patterns from trained networks remains a significant challenge. We have developed a novel framework to extract insights from a DNN trained to predict a trait, revealing how genes in the dataset may interact to influence the models predictions. Our approach is easy to incorporate or adapt to biological prior knowledge compared to existing methods, offering a powerful tool for discovering previously unknown gene-gene interactions.

12
GeneFix-AI: AI-Powered CRISPR-Cas9 System for Real-Time Detection and Correction of Mutations in Non-Human Species

Ali, M.

2025-05-08 bioinformatics 10.1101/2025.05.04.652132 medRxiv
Top 0.1%
23.5%
Show abstract

The evolution of genome engineering technologies has transformed biomedical research, enabling precise and efficient modification of genetic material Doudna and Charpentier, 2014. Among these, CRISPR-Cas9 stands out as a revolutionary gene-editing tool, though it often requires extensive expertise and technical knowledge Cong et al., 2013; J. G. Doench et al., 2016. We propose GeneFix-AI, an Artificial Intelligence (AI)-driven platform for real-time prediction and correction of genetic mutations in non-human species. Developed using cutting-edge models inspired by recent advances at Harvard and Peking University Chen et al., 2021; Wu et al., 2020, GeneFix-AI integrates machine learning to predict mutations, design optimal guide RNAs, and evaluate editing outcomes. This system aims to automate the CRISPR-Cas9 workflow, making high-precision gene editing more accessible to researchers without extensive molecular biology backgrounds Liu et al., 2019. We present the system architecture, training methodology, and potential impact of GeneFix-AI in democratizing genome editing and accelerating discoveries in genetics.

13
Symbolic Regression for Mycophenolic Acid Dosage Prediction in Kidney Transplant Recipients

Senivarapu, S.; Ananthanarayanan, A.; Murari, A.; Hu, B. Z.; Sha, A. Z.

2025-08-19 nephrology 10.1101/2025.08.15.25333810 medRxiv
Top 0.1%
23.2%
Show abstract

BackgroundChronic kidney disease (CKD) affects millions worldwide and often progresses to end-stage renal disease (ESRD), for which kidney transplantation remains the standard-of-care. Achieving optimal post-transplant immunosuppression-in particular, precise mycophenolic acid (MPA) dosing--is critical for long-term graft survival. However, in practice, dose personalization remains difficult, especially in underserved and rural populations where access to transplant pharmacology expertise is limited. MethodsWe developed an interpretable symbolic regression model trained on retrospective, multi-center kidney transplant datasets. Input variables included patient demographics, anthropometrics, initial MPA loading dose, primary immunosuppressant, and induction therapy regimen. For benchmarking, we also evaluated a suite of state-of-the-art machine learning models, including random forests and gradient-boosted trees. ResultsThe symbolic regression model identified clinically intuitive patterns: taller patients with lower initial MPA doses often require higher maintenance dosing; overly high starting doses are penalized; and dosing adjustments are strongly influenced by induction regimen parameters. While slightly less accurate than the best-performing black-box models with a mean-absolute error of 320 mg in dosage, the symbolic model maintains clinically acceptable error levels and offers full transparency of decision logic. ConclusionsOur explainable machine learning model delivers transparent, patient-specific MPA dosing recommendations that maintain clinically acceptable accuracy while revealing the underlying decision logic. By bridging the gap between complex pharmacologic modeling and point-of-care accessibility, this approach offers a viable pathway to improve post-transplant immunosuppression in settings where specialist support is scarce.

14
nleval: A Python Toolkit for Generating Benchmarking Datasets for Machine Learning with Biological Networks

Liu, R.; Krishnan, A.

2023-01-12 bioinformatics 10.1101/2023.01.10.523485 medRxiv
Top 0.1%
22.7%
Show abstract

Over the past decades, network biology has been a major driver of computational methods developed to better understand the functional roles of each gene in the human genome in their cellular context. Following the application of traditional semi-supervised and supervised machine learning (ML) techniques, the next wave of advances in network biology will come from leveraging graph neural networks (GNN). However, to test new GNN-based approaches, a systematic and comprehensive benchmarking resource that spans a diverse selection of biomedical networks and gene classification tasks is lacking. Here, we present the Open Biomedical Network Benchmark (OBNB), a collection of benchmarking datasets derived using networks from 15 sources and tasks that include predicting genes associated with a wide range of functions, traits, and diseases. The accompanying Python package, obnb, contains reusable modules that enable researchers to download source data from public databases or archived versions and set up ML-ready datasets that are compatible with popular GNN frameworks such as PyG and DGL. Our work lays the foundation for novel GNN applications in network biology. obob will also help network biologists easily set-up custom benchmarking datasets for answering new questions of interest and collaboratively engage with graph ML practitioners to enhance our understanding of the human genome. OBNB is released under the MIT license and is freely available on GitHub: https://github.com/krishnanlab/obnb

15
RatsPub: a webservice aided by deep learning to mine PubMed for addiction-related genes

Gunturkun, M. H.; Flashner, E.; Wang, T.; Mulligan, M. K.; Williams, R. W.; Prins, P.; Chen, H.

2020-11-05 bioinformatics 10.1101/2020.09.17.297358 medRxiv
Top 0.1%
22.7%
Show abstract

Interpreting and integrating results from omics studies typically requires a comprehensive and time consuming survey of extant literature. Here, we introduce GeneCup, an easy to use literature mining web service that searches all PubMed abstracts for user-provided gene symbols in conjunction with a set of custom keywords organized into a customized ontology, as well as results from human genome-wide association studies (GWAS). As an example, we organized over 300 keywords related to drug addiction into seven categories. The literature search is conducted by querying the NIH PubMed server using a programming interface, which is followed by retrieving abstracts from a local copy of the PubMed archive. The main results presented to the user are individual sentences containing the gene symbol, organized by the keywords they also contain. These sentences are presented through an interactive graphical interface or as tables. GWAS results are displayed using a similar method. All results are linked to the original abstract in PubMed. In addition, a convolutional neural network is employed to distinguish sentences describing systemic stress from those describing cellular stress. The automated and comprehensive search strategy provided by GeneCup facilitates the integration of new discoveries from omic studies with existing literature. GeneCup is free and open source software. The source code of GeneCup and the link to a running instance is available at https://github.com/hakangunturkun/GeneCup

16
AIChatBio: An Artificial Intelligence Chatbot Model for Biological Knowledge Retrieval and Biomacromolecule Design

Liu, E.; Liu, C.-Y.

2025-09-17 bioinformatics 10.1101/2025.09.11.675485 medRxiv
Top 0.1%
22.6%
Show abstract

Conversational agents for bioinformatics data analysis and interpretation remain largely inaccessible to the broader biological research community. This gap is especially pronounced in the current Generative AI era, which demands a paradigm shift in how researchers interact with computational tools. There is a pressing need to bridge well-established biological infrastructures and databases with the capabilities of Generative AI to democratize access to bioinformatics insights. In this study, we present an integrated framework that connects the robust bioinformatics resources of the National Center for Biotechnology Information (NCBI) with Generative AI through a novel Artificial Intelligence Chatbot Model for Biological Knowledge Retrieval and Biomacromolecule Design, AIChatBio. This operational model positions Generative AI as an intelligent information hub. User interactions with the chatbot are enriched by real-time data retrieval from web portal of the biological databases hosted at NCBI, which are then translated into structured inquiries toward the web applications of NCBI and bioinformatics analysis tools. These inquiries are directed toward bioinformatics analysis tools to perform tasks such as sequence alignment and primer design. Additionally, the outputs generated by these tools are interpreted by the chatbot, allowing users to gain meaningful insights without requiring deep technical expertise in bioinformatics. To demonstrate the feasibility of this approach, we developed a prototype implementation that integrates PCR primer design using Primer-BLAST [1], literature interpretation via PubMed for general topics, and the LitVar2 for SNPs associated topics [23]. This system was built using TypeScript and the ChatGPT API combining the bioinformatics web applications from NCBI, and its source code is publicly available via GitHub and the Chrome extension is available at Chrome Web Store. Our work highlights the potential of Generative AI to transform biological data analysis workflows, making them more intuitive, accessible, and scalable for researchers across disciplines.

17
Has AlphaFold 3 Solved the Protein Folding Problem for D-Peptides?

Childs, H.; Zhou, P.; Donald, B. R.

2025-03-17 bioinformatics 10.1101/2025.03.14.643307 medRxiv
Top 0.1%
22.3%
Show abstract

Due to the favorable chemical properties of mirrored chiral centers (such as improved stability, bioavailability, and membrane permeability) the computational design of D-peptides targeting biological L-proteins is a valuable area of research. To design these structures in silico, a computational workflow should correctly dock and fold a peptide while maintaining chiral centers. The latest AlphaFold 3 (AF3) from Abramson et al. (2024) enforces a strict chiral violation penalty to maintain chiral centers from model inputs and is reported to have a low chiral violation rate of only 4.4% on a PoseBusters benchmark containing diverse chiral molecules. Herein, we report the results of 3,255 experiments with AF3 to evaluate its ability to predict the fold, chirality, and binding pose of D-peptides in heterochiral complexes. Despite our inputs specifying explicit D-stereocenters, we report that the AF3 chiral violation for D-peptide binders is much higher at 51% across all evaluated predictions; on average the model is as accurate as chance (random chirality choice, L or D, for each peptide residue). Increasing the number of seeds failed to improve this violation rate. The AF3 predictions exhibit incorrect folds and binding poses, with D-peptides commonly oriented incorrectly in the L-protein binding pocket. Confidence metrics returned by AF3 also fail to distinguish predictions with low chirality violation and correct docking vs. predictions with high chirality violation and incorrect docking. We conclude that AF3 is a poor predictor of D-peptide chirality, fold, and binding pose. Finally, we propose solutions to improve this model. Significance StatementAlphaFold 3 (AF3) is a model trained to predict protein interactions. This algorithm is tuned to respect chiral centers (L and D). Changing the chirality of even one protein residue can significantly alter chemical properties such as binding and stability. Therefore, an algorithm should exhibit a chiral center error rate of 0%. Although the original AF3 authors reported a 4.4% chirality violation, we have found that the rate for D-peptides is much higher at [~]50%. Our data highlights a crucial structural prediction error in AF3 and demonstrates that this widely used model is as accurate on average as chance (random chirality choice, L or D, for each peptide residue). These results indicate structure prediction of D-peptides is an outstanding problem.

18
MOFA: Multi-Objective Flux Analysis for the COBRA Toolbox

Griesemer, M.; Navid, A.

2021-05-22 systems biology 10.1101/2021.05.20.445041 medRxiv
Top 0.1%
22.3%
Show abstract

Multi-objective Optimization (MO) is an important tool for quantitative examination of the trade-offs faced by biological organisms. Using genome-scale constraint-based models of metabolism (GSMs), Multi-Objective Flux Analysis (MOFA) allows MO analyses of trade-offs among key biological tasks. The leading software package for conducting a plethora of different types of constraint-based analyses using GSMs is the COBRA Toolbox for MATLAB. We have developed a new add-on tool for this toolbox using Normalized Normal Constraint (NNC) that performs MOFA for a number of objectives only limited by computation power (n [&le;] 10). This development will facilitate MOFA analyses by COBRAs large user base and allow greater multi-faceted examination of metabolic trade-offs in complicated biological systems. Availability and ImplementationThe MOFA software is freely available for download from https://bbs.llnl.gov under the GPL v2 license. The program runs on MATLAB with the COBRA software on Windows, Linux, and MacOS. It includes a detailed manual explaining the input and output of a simulation, a listing of the codes functions, and an example MOFA run using a well-curated GSM model of E. coli. Contactgriesemer1@llnl.gov or navid1@llnl.gov

19
jFuzzyMachine - An Open-Source Fuzzy Logic-based Regulatory Inference Engine for High Throughput Biological Data

Aiyetan, P.

2020-10-08 systems biology 10.1101/2020.10.06.315994 medRxiv
Top 0.1%
22.2%
Show abstract

Elucidating mechanistic relationships between and among intracellular macromolecules is fundamental to understanding the molecular basis of normal and diseased processes. Here, we introduce jFuzzyMachine - a fuzzy logic-based regulatory network inference engine for high-throughput biological data. We describe its design and implementation. We demonstrate its functions on a sampled expression profile of the vorinostat-resistant HCT116 cell line. We compared jFuzzyMachines inferred regulatory network to that inferred by the ARACNe (an Algorithm for the Reconstruction of Gene Regulatory Networks) tool. Potentially more sensitive, jFuzzyMachine showed a slight increase in identified regulatory edges compared to ARACNe. A significant overlap was also observed in the identified edges between the two inference methods. Over 70 percent of edges identified by ARACNe were identified by jFuzzyMachine. Beyond identifying edges, jFuzzyMachine shows direction of interactions, including bidirectional interactions - specifying regulatory inputs and outputs of inferred relationships. jFuzzyMachine addresses an apparent lack of freely available community tool implementing a fuzzy logic regulatory network inference method - mitigating a limitation to applying and extending benefits of the fuzzy inference system to understanding biological data. jFuzzyMachines source codes and precompiled binaries are freely available at the Github repository locations: https://github.com/paiyetan/jfuzzymachine and https://github.com/paiyetan/jfuzzymachine/releases/tag/v1.7.21.

20
LazyAF, a pipeline for accessible medium-scale in silico prediction of protein-protein interactions

McLean, T. C.

2024-02-01 bioinformatics Community evaluation 10.1101/2024.01.29.577767 medRxiv
Top 0.1%
22.2%
Show abstract

Artificial intelligence has revolutionized the field of protein structure prediction. However, with more powerful and complex software being developed, it is accessibility and ease of use rather than capability that is quickly becoming a limiting factor to end users. Here, I present a Google Colaboratory-based pipeline, named LazyAF, which integrates the existing ColabFold BATCH to streamline the process of medium-scale protein-protein interaction prediction. I apply LazyAF to predict the interactome of the 76 proteins encoded on a broad-host-range multi-drug resistance plasmid RK2, demonstrating the ease and accessibility the pipeline provides. AvailabilityLazyAF is freely available at https://github.com/ThomasCMcLean/LazyAF